NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Synthetic Text Generation for Training Large Language Models via Gradient Matching

Nguyen, Dang; Li, Zeman; Bateni, Mohammadhossein; Mirrokni, Vahab; Razaviyayn, Meisam; Mirzasoleiman, Baharan (July 2025, International Conference on Machine Learning (ICML))

Free, publicly-accessible full text available July 13, 2026
Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

Ene, Alina; Epasto, Alessandro; Mirrokni, Vahab; Nguyen, Hoai-An; Nguyen, Huy L; Woodruff, David P; Zhong, Peilin (July 2025, Proceedings of Machine Learning Research)

In the maximum coverage problem we are given d subsets from a universe [n], and the goal is to output k subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses polylogn update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the pth frequency moment of a vector for p ≥ 2. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to 210x over prior work.
more » « less
Free, publicly-accessible full text available July 13, 2026
Retraining with Predicted Hard Labels Provably Increases Model Accuracy

Das, Rudrajit; Dhillon, Inderjit_S; Epasto, Alessandro; Javanmard, Adel; Mao, Jieming; Mirrokni, Vahab; Sanghavi, Sujay; Zhong, Peilin (May 2025, https://doi.org/10.48550/arXiv.2406.11206)

Training with noisy labels often yields suboptimal performance, but retraining a model with its own predicted hard labels (binary 1/0 outputs) has been empirically shown to improve accuracy. This paper provides the first theoretical characterization of this phenomenon. In the setting of linearly separable binary classification with randomly corrupted labels, the authors prove that retraining can indeed improve the population accuracy compared to initial training with noisy labels. Retraining also has practical implications for local label differential privacy (DP), where models are trained with noisy labels. The authors propose consensus-based retraining, where retraining is done selectively on samples for which the predicted label matches the given noisy label. This approach significantly improves DP training accuracy at no additional privacy cost. For example, training ResNet-18 on CIFAR-100 with ε = 3 label DP achieves over 6% accuracy improvement with consensus-based retraining.
more » « less
Free, publicly-accessible full text available May 7, 2026
PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

Javanmard, Adel; Fahrbach, Matthew; Mirrokni, Vahab (July 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR, 2024.)

Full Text Available
PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

Javanmard, Adel; Fahrbach, Matthew; Mirrokni, Vahab (July 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR, 2024.)

Full Text Available
Optimistic Rates for Learning from Label Proportions

Li, Gene; Chen, Lin; Javanmard, Adel; Mirrokni, Vahab (July 2024, Proceedings of Thirty Seventh Conference On Learning Theory (COLT), PMLR, 2024.)

Full Text Available
Optimistic Rates for Learning from Label Proportions

Li, Gene; Chen, Lin; Javanmard, Adel; Mirrokni, Vahab (July 2024, Proceedings of Thirty Seventh Conference on Learning Theory, PMLR, 2024.)
Agrawal, Shipra; Roth, Aaron (Ed.)
Full Text Available
The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering

https://doi.org/10.14778/3712221.3712246

Yu, Shangdi; Shi, Jessica; Meindl, Jamison; Eisenstat, David; Ju, Xiaoen; Tavakkol, Sasan; Dhulipala, Laxman; Łącki, Jakub; Mirrokni, Vahab; Shun, Julian (November 2024, Proceedings of the VLDB Endowment)

We introduce the ParClusterers Benchmark Suite (PCBS)---a collection of highly scalable parallel graph clustering algorithms and benchmarking tools that streamline comparing different graph clustering algorithms and implementations. The benchmark includes clustering algorithms that target a wide range of modern clustering use cases, including community detection, classification, and dense subgraph mining. The benchmark toolkit makes it easy to run and evaluate multiple instances of different clustering algorithms with respect to both the running time and quality. We evaluate the PCBS algorithms empirically and find that they deliver both the state of the art quality and the running time. In terms of the running time, they are on average over 4x faster than the fastest library we compared to. In terms of quality, the correlation clustering algorithm [Shi et al., VLDB'21] optimizing for the LambdaCC objective, which does not have a direct counterpart in other libraries, delivers the highest quality in the majority of datasets that we used.
more » « less
Full Text Available
Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions

Javanmard, Adel; Chen, Lin; Mirrokni, Vahab; Badanidiyuru, Ashwinkumar; Fu, Gang (May 2024, The Twelfth International Conference on Learning Representations (ICLR), 2024)

Full Text Available
Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization

Javanmard, Adel; Mirrokni, Vahab (December 2023, Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track)

Full Text Available

« Prev Next »

Search for: All records